So, yeah, thank you. Thank you very much for this quite extensive introduction. I'm very
happy to be here today. And today I would like to talk about some of the challenges
that I as a researcher face on a day-to-day basis when I'm dealing with artificial intelligence.
Namely, I want to talk about how biases and shortcuts impact the models that we train
in this area. I'm, as Professor Honegger nicely introduced, I'm a researcher in medical imaging
and I want to get computers to try to support different aspects of patient care. So diagnosis,
for example, improving treatment strategies, improving disease monitoring and the like.
And I want to start motivating these shortcuts and biases from a slightly different perspective.
Yesterday, if you joined this stream yesterday, you probably heard about potential applications
of AI in human resources. But some time ago, and if you follow the media in this area,
you probably have heard about these aspects. So Amazon was experimenting with a system
to assess job applications for automatic interview decisions. However, when they investigated
the system more closely and investigated the decisions that the system had more closely,
they found that it had a considerable bias against women. It is very likely that this
behavior was based on prior hiring decisions. So an existing bias in this historical data,
over the last 10 years, was very likely amplified. And this is, of course, not something that
an AI is supposed to do if it's working in practice. Also, quite recently, there was
another report of an AI system that behaved unexpectedly. Studies showed that many face
recognition algorithms from different vendors depend very strongly on what gender and skin
color a participant has. So face recognition software often shows much lower performance,
for example, for black females compared to white males. And these are just two examples.
There are many AI algorithms that behave differently in the lab compared to how they behave in
practice and afterwards during the application. So why do they fail? And why do they fail
unexpectedly? Why is it sometimes so challenging to have very promising applications to transfer
them to the real world? And this question is also relevant in my line of research. So
many recent publications deal with the question of robustness. How can we get our algorithms
to work across different patient populations, to work across different scanners, and to
work across different hospitals? I want to start with a simple example that kind of highlights
this very nicely. Imagine that you get a data set for an application from a hospital, and
you want to distinguish malignant and benign lesions in this case. So you start developing
an algorithm. And what comes to mind when you see this data set that, well, the malignant
lesions kind of have a different color compared to the benign lesions. And many state of the
art algorithms, many state of the art AI approaches will also find color as a discriminative feature
because it's very easy for them to pick this feature up. So we get a network that performs
very well on this kind of data. So we publish a paper. We get very nice results. However,
then we apply the network in practice, and we see that it falls short of our expectations.
So we investigate it closer. And what we can see here is that color is actually not the
discriminative feature. If we inspect it closer, we can see that the actual feature that discriminates
malignant and benign lesions is the shape of these lesions. And what is important here
is that we could have picked up on this feature already in the first data set. It was just
easier to find color. But color was a wrong correlation, a spurious correlation that we
saw in this subset of our data. So without having any additional knowledge and given
just this first data set, we can't really say the AI did wrong because it followed basically
the data. It's just a shortcut that it found that did not translate and did not generalize
to a new setting. Now, this might have been a relatively artificial example, but this
can also happen in practice. In a recent study, researchers investigated how deep learning
can be used for pneumonia detection x-rays. And they were able to show that networks can
very easily pick up certain markers and discriminate based on small x-ray markers that are placed
in the image. Now, these markers, they are different between hospitals, so the networks
are able to differentiate basically where the images come from. And imagine that you
Zugänglich über
Offener Zugang
Dauer
00:24:41 Min
Aufnahmedatum
2020-10-16
Hochgeladen am
2020-10-26 17:07:03
Sprache
en-US